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Abstract 

In free word order languages, every sentence 
is embedded in its specific context. The order 
of constituents is determined by the categories 
theme, rheme and contrastive focus. This pa- 
per shows how to recognise and to translate 
these categories automatically on a senten- 
tial basis, so that sentence embedding can be 
achieved without having to refer to the con- 
text. Traditionally neglected modifier classes 
are fully covered by the proposed method. 



egories theme , rheme and contrastive focus 
during analysis. This method has been im- 
plemented successfully in the unification and 
constraint-based Machine Translation system 
CAT2 (Sharp 1989, Steinberger 1992a). It in- 
cludes the ordering of modifiers, which are tra- 
ditionally left out in word order description 
(Conlon/Evens 1992). All statements in this 
paper concern written language, as spoken lan- 
guage is more liberal with respect to ordering. 

2 The Data 
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1 Introduction 

Most languages known as free word order lan- 
guages are in fact languages with partially free 
word order (Engelkamp et al. 1992), or rather 
free phrase order (Schaufele 1991). A diffi- 
culty linked to the formal description of these 
languages is that instead of a complete lack 
of ordering rules many subtle restrictions ap- 
ply. A large amount of word order variations 
are grammatical in isolated sentences, but con- 
text restricts the number of sequences which 
are possible and natural. In this sense, sen- 
tences are embedded in their context. A spe- 
cific context calls for a certain word order, and 
the word order of a given sentence reflects its 
context. 

In this paper, we present recent sugges- 
tions on how to treat free phrase order in Nat- 
ural Language Processing (NLP), and present 
an alternative solution to the problem. The 
idea is to use a thematically-tagged } or flexi- 
ble, canonical form (CF) for generation, and 
an algorithm to recognise the relevant cat- 



We shall start by presenting some data which 
illustrates the problems related to word order 
treatment in NLP. Many ordering variations 
are possible (la - le, 2a, 2b), but some of them 
are less natural (le), and others are even un- 
grammatical (2c, 2d), le is only acceptable if 
the personal pronoun ich is heavily stressed, 
indicated here in capitals. 1 

la Morgeii26 werde ich ihn vielleichti2 besuchen. 
Tomorrow will I him probably visit 

lb Ich werde ihn vielleichti2 morgeri26 besuchen. 
/ will him probably tomorrow visit 

lc Ich werde ihn morgen 2 6 vielleicht 12 besuchen. 
/ will him tomorrow probably visit 

Id Vielleichti2 werde ich ihn morgen26 besuchen. 
Probably will I him tomorrow visit 

le ? Morgen22 werde ihn vielleichti2 ICH besuchen. 
Tomorrow will him probably I visit 

2a Er fuhr dennoch2o ebenfalls35 nach Miinchen. 
He drove nevertheless also to Munich 



1 The use of the index numbers will be explained in 
section 5. 
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2b Dennoch 2 o fuhr er ebenfalls 35 nach Miinchen. 
Nevertheless drove he also to Munich 

2c * Er fuhr ebenfalls35 dennocli2o nach Miinchen. 
He drove also nevertheless to Munich 

2d * Ebenfalls35 fuhr er dennoch2o nach Miinchen. 
Also drove he nevertheless to Munich 

Depending on the context, different word 
orders are either required or, at the very least, 
they are more natural than others. Although 
in 3 and 4 the context is represented by ques- 
tions, it is not normally limited to these. 3a, 
which is the most natural answer to 3, is very 
unnatural, if not ungrammatical, in 4. Al- 
though not all contexts restrict the order of 
constituents as drastically as 3 and 4, it is a 
general rule for German and similar languages 
that sentences are more natural if they are 
properly embedded in their contexts: 

3 Wen erwartete die Frau mit dem Nudelholz? 
Whom waited-for the woman with the rolling pin 

3a Die Frau erwartete mit dem Nudelholz ihren 
MANN. 

The woman waited-for with the rolling pin her 
husband 

3b ? Die Frau erwartete ihren MANN mit dem 
Nudelholz. 

The woman waited-for her husband with the 
rolling pin 

4 Mit was erwartete die Frau ihren Mann? 
With-what waited-for the woman her husband 

4a Die Frau erwartete ihren Mann mit dem NUdel- 
holz. 

The woman waited-for her husband with the 
rolling pin 

4b ?? Die Frau erwartete mit dem NUdelholz ihren 
Mann. 

The woman waited-for with the rolling pin her 
husband 

It is generally acknowledged that the com- 
bination of several factors determines the or- 
der of constituents in German and similar lan- 
guages. In Steinberger (1994), eleven princi- 
ples acting on the pragmatic, semantic and 
syntactic levels are listed, each of which can 
be reformulated as one or several linear prece- 
dence (LP) rules. The factors comprise of 
the tendencies to order elements according to 



the theme-rheme structure and/or to the func- 
tional sentence perspective. Furthermore, they 
concern verb bonding, animacy, heaviness, the 
importance of semantic roles for phrase order- 
ing, and others. A distinct feature of the order- 
ing regularities is that none of the factors can 
be formulated as an absolute LP rule, which 
makes word order description difficult to deal 
with in NLP. In recent years several proposi- 
tions were made to deal with this phenomenon 
in either analysis or generation, or both. 

3 Recent Suggestions on 
Treating Free Phrase 
Order 

Uszkoreit (1987) suggests overcoming the lack 
of absolute rules by using disjunctions of LP 
rules. The idea is that if at least one LP 
rule sanctions a sequence of constituents, the 
sentence is grammatical. The model thus ex- 
presses competence, rather than performance, 
as it either accepts or rejects a sentence, with- 
out making a judgement on acceptability dif- 
ferences as in 1. 

Another idea put forward by Erbach 
(1993) accounts for grades of acceptability. Er- 
bach assumes that the order of verb comple- 
ments ideally is according to an obliqueness 
hierarchy, and that each deviation from this or- 
der decreases the acceptability of the sentence 
by a factor of 0.8. Two divergences result in 
an acceptability score of 0.64 (0.8 * 0.8), etc. 
Problems we see linked to this approach are 
the use of the obliqueness hierarchy, which lim- 
its the preference mechanism to complements, 
and the fact that every diversion decreases the 
score invariably, without considering the vary- 
ing effect of different variations. 

A proposal which takes into account the 
different importance, or weight, of preference 
rules, is presented in Jacobs (1988). Jacobs 
assigns each of his preference rules a specific 
numerical weight. If a rule applies in a given 
sentence, its value is added to the acceptability 
score of the sentence, if it is violated, its value 
is subtracted. The higher the final score, the 
more natural, or the 'better 7 the sentence is. 
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Ideally, all competing preference rules are sat- 
isfied. The complication we see with this ap- 
proach is that some strictly ordered sequences 
interfere with the calculation of acceptability. 
Some of them concern the ordering of ton- 
ers (Abtonungspartikeln; Thurmair 1989) and 
other modifier subgroups (Steinberger 1994). 

Some of the criticism could be overcome 
by changing the different propositions slightly. 
For instance, Erbach's (1993) suggestion to 
add preference to feature-based formalisms 
could be combined with Uszkoreit's preference 
rules. An idea to solve the problems linked 
to Jacobs' weighing mechanism would be to 
combine it with absolute LP rules, in order to 
avoid ungrammatical sequences. However, we 
want to suggest another method, based on our 
findings concerning natural, marked and un- 
grammatical word order, and making use of 
the categories theme, rheme and contrastive 
focus (henceforth simply called focus). 

4 The New Model 

In our approach (cf. Steinberger 1994), we 
have different ways of dealing with free phrase 
order in analysis and generation. In analysis 
(cf. section 6), grammars have to allow most 
orderings, as barely any phrase order can be 
completely excluded. Once a structure is as- 
signed to an input sentence, we suggest that 
thematic, rhematic and contrastively focussed 
elements be identified by using our insights 
concerning the recognition of these categories. 
This information concerning functional sen- 
tence perspective can and should be conveyed 
in the target language of the translation. 

With respect to generation (cf. section 
5), acceptable orderings are defined by a sin- 
gle comprehensive linear precedence (LP) rule 
which not only assigns strict priorities to sym- 
bols tagged for syntactic category (e.g. N for 
nominative NP, SIT for situative complement, 
M for modifier), but also for the thematic cat- 
egories theme , rheme and contrastive focus. It 
is crucial that the relative ordering of syntac- 
tic symbols can be varied by varying their re- 
spective thematic markings. The LP rule also 
assigns priorities to syntactic categories which 



are not thematically marked. Thus, a syntac- 
tic element is assigned a default position if no 
thematic information is available, but is moved 
out of this default position if thematic infor- 
mation is present. In this way, a single rule 
represents a fixed canonical form for unmarked 
elements and at the same time permits widely 
varying (though not truly free) orderings for 
thematically marked cases. 

Generation and analysis according to this 
method will be presented in more detail now. 

5 Generation 

We argue in Steinberger (1994) that the use 
of a comprehensive LP rule, as presented in 
the previous section, is an efficient way of gen- 
erating sentences which not only are correct 
in some contexts but which comply with their 
contextual restrictions. This flexible output is 
achieved by using the three thematic categories 
theme , rheme and contrastive focus , which can 
capture complements as well as modifiers re- 
alised by all phrasal categories. Table I shows 
such a CF for German. 

The table is to be read from left to right 
and from top to bottom. The letters N, A, 
D, G represent the four cases nominative, ac- 
cusative, dative, and genitive. PO stands for 
prepositional object, and SIT, DIR and EXP 
for situative, directional and expansive com- 
plements. Nom and Adj are nominal and ad- 
jectival complements, M represents the diverse 
groups of modifiers. The feature +/-d refers to 
definiteness, +/-a to animacy, SVC to support 
verb constructions, and the index numbers to 
M indicate the relative order of modifiers (Mi 
precedes M 2 , and so on). The index numbers 
are based on Hoberg's classification (198 1 ) . If 
elements cannot cooccur, they are separated 
by a slash (/), as opposed to by an arrow (<). 

The CF imposes linear order on an un- 
ordered set of arguments and modifiers. When 
the analysis of the source language fails to 
recognise theme, rheme and focus, a default 
order is generated. Although no CF sequence 
can produce good sentences in all contexts (cf. 
3 and 4), the default order is suitable in a large 
amount of contexts. 
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N pron /N +(i+6 < (A<D/Nom/Adj) pron < THEME < N +(i _ a /N_ d+a < 

< (^pron/^ + d+a) + focus/(^<'D)pron+focus < (A<D) + (i+a < G pron < N_ d _ a < (A<D) + (i _ a < 
< M pra3m ( al _ 18 ) < M s8 ' t ( al9 _ 40 ) < M nefl ( 41 ) < M m0( i( 42 _43) < 

< (N +( i_ a /N_ (i /(A<D) +(i /G pron /M( a i_43)) +r / ieme < (A<D)_ d+a < a m0(i ( 44 ) < 
< PO pron < (A<D)_ d _ a < P0 +d+a < P0 +(i _ a < P0_ d+a < P0_ d _ a < G nom < 
< (A/B/G/PO/N +d _ a /N_ d /M pragm /M stt /M mod ) +focus < 
< SIT/DIR/EXP < (Nom/Adj)_ pron < (N/A/D/G/P0) 5 yc 
Table 1: 'Thematically-tagged' Canonical Form for German 



Before showing some example sentences 
generated by this CF, we have to mention 
one particularity of German, which is that the 
verb is in second position in declarative ma- 
trix clauses (verb-second, or V2 position), and 
in final position in subordinate clauses (verb- 
final, or VF position). Nearly any element can 
take the one position preceding the verb in V2, 
called the Vorfeld ("pre- (verbal) field"). Nor- 
mally a thematic element is placed into the 
Vorfeld. According to Hoberg's (1981) analy- 
sis of the Mannheimer Duden Korpus, in 63% 
of all V2 sentences the nominative complement 
(subject) takes this place. A convenient way of 
seeing it is that all elements follow the verb in 
V2 position according to the CF, and that one 
(thematic) element is moved into the Vorfeld 
position. We suggest that if the analysis of the 
source language fails to recognise the theme of 
the sentence, the subject takes this place. 

In our model, most elements can either be 
thematic, rhematic, or neutral (i.e. unmarked 
with respect to theme and rheme). Sentence 
variations as different as shown in the exam- 
ples 5a to 5d can be generated using the canon- 
ical form presented above, depending on the 
parameterisation of the features theme, rheme 
and focus for the different constituents. The 
order of elements in 5a corresponds to the de- 
fault order. However, the same order would be 
generated if the personal pronoun was marked 
as being thematic, and/or if the adverb gestern 
was rhematic. We put the information +theme 
in 5a to 5c in brackets to indicate that this 



feature is not a requirement to generate the 
respective word orders. The relative order of 
the adverb and the accusative NP in 5b dif- 
fers from the one in 5a, because the object den 
Mann is rhematic. In 5c and 5d, gestern and 
den Mann are thematic, respectively. In ad- 
dition to this, the personal pronoun in 5d is 
marked as being stressed contrastively. We 
used capital letters to express the obligatory 
focus. It is easy to think of more phrase order 
combinations caused by further parameterisa- 
tions. 

5a Ich (+tfteme) habe den Mann gestern 26(+rfteme) 
gesehen. (A +d+a -M 26 ) 

/ have the man yesterday seen 

5b Ich( +tfteme ) habe gestern 26 den Mann +rfteme gese- 
hen. 

/ have yesterday the man seen 

5c Gestern26+tfteme habe ich den Mann(_|_ rfteme -) 
gesehen. 

Yesterday have I the man seen 

5d weil gestern 26+tfteme ICH + / 0Ctls den Mann 
gesehen habe. 

... because yesterday I the man seen have. 

Modifiers should be classified according to 
Hoberg's (1981) 44 modifier position classes, 
which partly coincide with the common seman- 
tic classifications, and partly not. Hoberg's 
modifier indexes are the result of the statistical 
verification of Engel's intuitive classes (1970). 
As modifiers do not always follow in the same 
order, Hoberg chose a classification which lead 
to least deviations between her classification 



4 



and the order in the corpus used (Mannheimer 
Duden Korpus). The following sentences ex- 
emplify the order of the CF for modifiers: 

6a Ich habe deshalb22 gesterii26 mit W0H42 fernge- 
sehen. 

I have therefore yesterday with Wolf watched-tv 

6b Ich habe deshalb22 mit W0U42 gestern26+rfteme 
ferngesehen. 

I have therefore with Wolf yesterday watched-tv 

7 Damals26+tfteme bin ich Frauen ohnehing oft 37 
ubersturzt43 davongelaufen. 

Then am I women anyway often overhastyly ran- 
away (Then, I often ran away from women over- 
hastily anyway) 

Due to the procedure described in this sec- 
tion, ungrammatical sentences such as 2c and 
2d can be avoided successfully. 

6 Analysis 

The generation of contextually embedded sen- 
tences is based on the successful analysis of 
theme and rheme constituents. The recogni- 
tion of contrastive stress is even more impor- 
tant. A basic fact that can be used for the au- 
tomatic recognition of these categories is that 
not only the context determines the ordering 
of constituents in an embedded sentence, but 
also a given sentence carries information on 
the context to which it belongs. When Ger- 
man native speakers see the sentence 3a/4b, 
for instance, they have a strong feeling about 
the context in which it occurs. It is very likely 
that the NP ihren Mann is stressed. It is either 
rhematic, or it carries contrastive focus, le is 
even more restricted. The personal pronoun 
ich must be contrastively stressed (I myself am 
the person who visits him). In every context 
requiring another stress, le is ungrammatical. 
It is thus possible to extract information on 
the context of a given sentence without having 
access to the preceding sentences. 

Analysis grammars must allow most con- 
stituent order variations, as the number of 
phrase orders that can be excluded is very lim- 
ited. The difference with generation grammars 
is that it is sufficient to generate one 'good' 
phrase order for each context, whereas in anal- 
ysis all possible variations have to be allowed. 



For this reason, the CF is of no use for anal- 
ysis. Instead, analysis grammars should allow 
all grammatical orders and identify thematic, 
rhematic and fo cussed phrases. 

In our algorithm, the number of possi- 
ble themes and rhemes is limited to one con- 
stituent each, as this is sufficient to generate 
the variations in 5 to 7. Firstly, focus should 
be identified, and after this theme and rheme. 
Some permutations are only possible if one 
constituent is stressed contrastively. These 
constructions include the Vorfeld position of 
some typically rhematic elements (8, 9), the 
right movement of constituents which have a 
strong tendency to the left (cf. le and 5d 
above), and others (Steinberger 1994). 

8 Nach FRANKreich_|_^ 0Ctls ist Vahe gefiogen. 
To France is Vahe flew (Vahe flew to France) 

9 Einen INder_|_^ 0Ctls hat Anne geheiratet. 

An Indian has Anne married (Anne has married 
an Indian) 

In the next step, the theme category is 
identified. Every element at the beginning of 
the clause is marked theme if it has not 
been identified focus in the preceding step 
(10, 11): 

10 Damals_|_ t ft eme lebte Hendrix noch. 

Then lived Hendrix still (Hendrix was still alive 
then) 

11 Ich glaube, dafi Tina_|_ t ft eme oft kocht. 
/ believe that Tina often cooks 

Similar to Hajicova et al.'s (1993) sug- 
gestion for English, and to Matsubara et al.'s 
(1993) for Japanese, the last constituent of 
the sentence will be recognised as rhematic, as 
rhemes tend to occur sentence-finally (cf. 5a 
and 6b). Our approach differs from Hajicova 
et al.'s, however, in that we prohibit some ele- 
ments from being rhematic. In German, these 
inherently non-rhematic elements include per- 
sonal pronouns, as well as a limited set of 
modifiers such as wohl in 12. Although some 
modifier groups tend to be potential rhemes, 
and others do not, most modifiers must be 
coded individually in the dictionary (Stein- 
berger, 1994). Note that if inherently non- 
rhematic elements occur sentence-finally, it is 
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likely that either the verb in V2 position, or 
the Vorfeld element carry heavy stress (12a vs. 
12b). 

12a Er LAS + f 0CUS den Artikel uber Wortstellung 
dann wohl- r h eme . 

He read the article on word-order then presum- 
ably 

12b ?? Er las den ArTIkel uber Wortstellung dann 

wohl— r }i eme . 

Hajioova et al. (1993) suggest that verbs 
are generally marked as rhemes, except if they 
have very general lexical meaning (such as be, 
have, happen, carry out, become). As our main 
concern is word order, and German verb place- 
ment is restricted by rules which do not al- 
low variation, our algorithm does not allow 
the recognition of verbs as rhemes. In 12, no 
constituent would be recognised as being rhe- 
matic. 

Not all languages express theme, rheme 
and focus as distinctly by word order vari- 
ation as German does. Either they rely on 
the context to find out which constituents 
(have to) carry stress, or they use other means 
such as clefting, pseudo-clefting, topicalisa- 
tion, dislocation, voice, impersonal construc- 
tions, particles, and morphological as well as 
lexical means (Foley/ Van Valin 1985). How- 
ever, even in English, which is often referred 
to fixed word order language, information 
on theme and rheme can be extracted auto- 
matically (Hajicova et. al. 1993; Steinberger 
1992a). To which degree this information is 
conveyed in other languages, and by which 
means, must be subject to a language pair- 
specific investigation. The extraction of infor- 
mation on theme, rheme and focus is more im- 
portant when translating from one free phrase 
order language into another, than when trans- 
lating into a fixed-word order language. How- 
ever, there are independent reasons for recog- 
nising the sentence focus, namely the correla- 
tion between stress on the one hand, and scope 
of negation (Payne 1985) and of degree modi- 
fiers (Steinberger 1992b) on the other. 



7 Ambiguity Resolution 

Findings on natural, less natural and ungram- 
matical word order variations can also be used 
to improve sentence analysis with respect to 
some cases of ambiguity resolution. In the 
case of 13, eher can be recognised as denoting 
earlier (eher 2 6), as the homonymous adverb 
(eher 5 , "rather") must not be negated. Fur- 
thermore, some cases of unlikely PP attach- 
ment can be nearly excluded. In 14, the PP 
expressing location (vor der Bank) is unlikely 
to be a sentence modifier, as this would result 
in contrastive focussing of the personal pro- 
noun ihn. This can be seen in 15, where the 
PP cannot be an adjunct to the preceding NP, 
because the NP is realised as a pronoun. The 
PP in 14 is thus more likely to be an adjunct to 
the nominative NP der Mann (14a) than a sen- 
tence modifier (14b). The general principle is 
that focussing constructions are relatively un- 
likely to occur in written text, and therefore 
one should avoid the analysis involving focus 
when another analysis is possible. This is the 
case when the analysis of the PP as an adjunct 
results in a sentence without contrastive stress. 

13a Er sollte nicht eher26 kommen. (not earlier) 

He should not earlier come (He should not come 
earlier) 

13b * Er sollte nicht ehers kommen. (rather) 
He should not rather come 

14 Deshalb hat der Mann vor der Bank ihn gesehen. 

Therefore has the man in- front- of the bank him 
seen (Therefore the man in front of the bank has 
seen him) 

14a Deshalb hat der Mann vor der Bank ihn gesehen. 

14b ? Deshalb hat der Mann vor der Bank IHN ig- 
noriert. 

15 ?? Deshalb hat er vor der Bank IHN gesehen. 
Therefore has he in- front- of the bank him seen 

8 Conclusion 

The order of constituents in free phrase or- 
der languages is determined by a set of fac- 
tors which constitute tendencies rather than 
clear-cut rules. The fact that most, but not all, 
constituent orders are possible, and that some 
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orders are more natural than others poses a 
considerable problem for NLP. 

In this paper, we presented a method to 
deal with these problems from the analysis and 
the generation point of view. Concerning anal- 
ysis, the main idea is that single sentences re- 
flect the theme-rheme structure imposed by 
the context, so that thematic, rhematic and 
(contrastively) focussed constituents can often 
be recognised. In generation, we can convey 
this knowledge by differing word order depend- 
ing on the context. This is achieved by using a 
canonical form which includes the flexible cat- 
egories theme , rheme and contrastive focus. 

A major advantage over methods sug- 
gested in the past is that acceptability differ- 
ences between sentences can be dealt with, and 
that even modifier sequences, which are tra- 
ditionally left out in word order description, 
can be handled. Wrong constituent orders are 
avoided, because the order of the major part 
of the sentence is fixed, and only single con- 
stituents move to the theme and rheme posi- 
tions. The difficulty arising from the unclear 
borderline between free and fixed phrase or- 
der, which is typical of most free phrase order 
languages, is dealt with successfully. 
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